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Abstract — A wide variety of privacy metrics have been pro- 
posed in the literature to evaluate the level of protection offered 
by privacy enhancing-technologies. Most of these metrics are 
specific to concrete systems and adversarial models, and are 
difficult to generalize or translate to other contexts. Further- 
more, a better understanding of the relationships between the 
different privacy metrics is needed to enable more grounded and 
systematic approach to measuring privacy, as well as to assist 
system designers in selecting the most appropriate metric for a 
given application. 

In this work we propose a theoretical framework for privacy- 
preserving systems, endowed with a general definition of privacy 
in terms of the estimation error incurred by an attacker who aims 
to disclose the private information that the system is designed to 
conceal. We show that our framework permits interpreting and 
comparing a number of well-known metrics under a common per- 
spective. The arguments behind these interpretations are based 
on fundamental results related to the theories of information, 
probability and Bayes decision. 

Index Terms — Privacy, criteria, metrics, estimation, Bayes 
decision theory, statistical disclosure control, anonymous-com- 
munication systems, location-based services. 

I. Introduction 

The widespread use of information and communication tech- 
nologies to conduct all kinds of activities has in recent years 
raised privacy concerns. There is a wide diversity of applica- 
tions with a potential privacy impact, from social networking 
platforms to e-commerce or mobile phone applications. 

A variety of privacy-enhancing technologies (PETs) have 
been proposed to enable the provision of new services and 
functionalities while mitigating potential privacy threats. The 
privacy concerns arising in different applications are diverse, 
and so are the corresponding privacy-enhanced solutions that 
address these concerns. Similarly, various ad hoc privacy 
metrics have been proposed in the literature to evaluate the 
effectiveness of PETs. The relationships between these differ- 
ent metrics have however not been investigated in depth, what 
leads to a fragmentation in the understanding of how privacy 
properties can be measured. 

In this paper we consider a general, theoretical frame- 
work for privacy-preserving systems and propose using the 
attacker's estimation error as privacy metric. We show that 
the most widely used privacy metrics, such as /c-anonymity, 
/-diversity, t-closeness, e-differential privacy, as well as 
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information-theoretic metrics such as Shannon's entropy, min- 
entropy, or mutual information, may be construed as particular 
cases of the estimation error. 

Privacy metrics, accompanied with utility metrics, provide 
a quantitative means of comparing the suitability of two or 
more privacy-enhancing mechanisms, in terms of the privacy- 
utility trade-off posed. Ultimately, such metrics will enable 
us to systematically build privacy-aware information systems 
by formulating design decisions as optimization problems, 
solvable theoretically or numerically, capitalizing on a rich 
variety of mature ideas and powerful techniques from the wide 
field of optimization engineering. 

We illustrate how the general model can be instantiated 
in three very different areas of application, namely statistical 
disclosure control, anonymous communications and location- 
based services. Statistical disclosure control (SDC) Q is the 
research area that deals with the inherent compromise between 
protecting the privacy of the individuals in a microdata set 
and ensuring that those data are still useful for researchers. 
Traditionally, institutes and governmental statistical agencies 
have systematically gathered information about individuals 
with the aim of distributing those data to the research commu- 
nity |2|. However, the distribution of this information should 
not compromise respondents' privacy in the sense of revealing 
information about specific individuals. Motivated by this, con- 
siderable research effort has been devoted to the development 
of privacy -protecting mechanisms l^^. 01, 0, 0, 13 to be 
applied to the microdata sets before their release. In essence, 
these mechanisms rely upon some form of perturbation that 
permits enhancing privacy to a certain extent, at the cost of 
losing some of the data utility with respect to the unperturbed 
version. 

With the aim of assessing the effectiveness of such mech- 
anisms, numerous privacy metrics have been investigated. 
Probably, the best-known privacy metric is k-anonymity, which 
was first proposed in (H, Igl. In an attempt to address the 
limitations of this proposal, various extensions and enhance- 
ments were introduced later in (TOl, El, (13, CS, Cll, ifBl . 
While all these proposals have contributed to some extent to 
the understanding of the privacy requirements of this field, the 
SDC research community would undoubtedly benefit from the 
existence of a rule that could help them decide which privacy 
metric is the most appropriate for a particular application. 
In other words, there is a need for the establishment of a 
framework that enables us to compare those metrics and to 
formulate them by using a common, general definition of 
privacy. 

In anonymous communications, one of the goals is to 
conceal who talks to whom against an adversary who observes 
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the inputs and outputs of the anonymous communication 
channel. Mixes |16|, l lTj, fTS] are a basic building block 
for implementing anonymous communication channels. Mixes 
perform cryptographic operations on messages such that it is 
not possible to correlate their inputs and outputs based on their 
bit patterns. In addition, mixes delay and reorder messages 
to hinder the linking of inputs and outputs based on timing 
information. Delaying messages has an impact on the usability 
of the system, and therefore imposes a cost on the system. 
On the other hand, higher delays allow for stronger levels of 
privacy. There is thus a trade-off between delay (cost) and 
anonymity (privacy), and optimizing the level of anonymity 
for a given expected delay is interesting to extract as much 
protection as possible from the anonymous channel at the 
lower possible cost. 

In the end, we approach the particularly rich, important ex- 
ample of location-based services (LBSs), where users submit 
queries along with the location to which those queries refer. 
An example would be the query "Where is the nearest Italian 
restaurant?", accompanied by the geographic coordinates of 
the user's current location. In this scenario, a wide range of 
approaches have been proposed, many of them based on an 
intelligent perturbation of the user coordinates submitted to 
the provider |[T9l . Essentially, users may contact an untrusted 
LBS provider directly, perturbing their location information 
so as to hinder providers in their efforts to compromise user 
privacy in terms of location, although clearly not in terms of 
query contents and activity, and at the cost of an inaccurate 
answer. In a nutshell, this approach presents again the inherent 
trade-off between data utility and privacy common to any 
perturbative privacy method. 

The survey of privacy metrics, the detailed analysis of 
their connection with information theory, and the mathematical 
unification as an attacker's estimation error presented in this 
paper shed new light on the understanding of those metrics and 
their suitability when it comes to applying them to specific 
scenarios. In regard to this aspect, two sections are devoted 
to the classification of several privacy metrics, showing the 
relationships with our proposal and the correspondence with 
assumptions on the attacker's strategy. While the former 
section approaches this from a theoretical perspective, the 
latter is written as a guide to help system designers choose 
the appropriate metrics, without having to delve into the 
mathematical details. We also hope to illustrate the riveting 
intersection between the fields of information privacy and 
information theory, in an attempt towards bridging the gap 
between the respective communities. 

II. Related work 

In this section we provide an overview of privacy metrics with 
an emphasis on those used in the three applications under 
study: anonymous communications, location-based services 
and statistical disclosure control. 

A. Anonymous-Communication Systems and Location-Based 
Services 

Mixes were proposed by Chaum |[T6ll in 1981, and are a ba- 
sic building block for implementing high-latency anonymous 



communications. A mix takes a number of input messages, 
and outputs them in such a way that it is infeasible to link 
an output to its corresponding input. In order to achieve this 
goal, the mix changes the appearance (by encrypting and 
padding messages) and the flow of messages (by delaying 
and reordering them). Mixmaster 1 17] and Mixminion [18] are 
more advanced versions of the Chaumian mix (161, and they 
haven been deployed to provide anonymous email services. 

Several metrics have been proposed in the literature to 
assess the level of anonymity provided by anonymous-com- 
munication systems (ACSs). Reiter and Rubin |20| define 
the degree of anonymity as a probability 1 — p, where p 
is the probability assigned by an attacker to the potential 
initiators of a communication. In this model, users are more 
anonymous as they appear (towards a certain adversary) to 
be less likely of having sent a message, and the metric 
is thus computed individually for each user and for each 
communication. Berthold et al. [21 J on the other hand define 
the degree of anonymity as the binary logarithm of the number 
of users of the system, which may be regarded as a Hartley 
entropy. This metric only depends on the number of users of 
the system, and does not take into account that some users 
might appear as more likely senders of a message than others. 

Information theoretic anonymity metrics were indepen- 
dently proposed in two papers. The metric proposed by Ser- 
jantov and Danezis |[22l uses Shannon's entropy as measure of 
the effective anonymity set size. The metric proposed by Diaz 
et al. 1231 normalizes Shannon's entropy to obtain a degree of 
anonymity on a scale from to 1. 

Toth et al. fl^\ argue that Shannon entropy may not provide 
relevant information to some users, as it considers the average 
instead of the worst-case scenario for a particular user. They 
suggest using instead a local anonymity measure computed 
from min-entropy and max-entropy. Clauss and Schiffner [25 1 
proposed Renyi entropy as a generalization of Shannon, min- 
and max-entropy -based anonymity metrics. 

Other anonymity metrics in the literature include possi- 
bilistic (instead of probabilistic) approaches, such as those 
proposed by Syverson and Stubblebine |26|, Mauw et al. 1271 , 
or Feigenbaum et al. | 28 1. According to these metrics, subjects 
are considered anonymous if the adversary cannot determine 
their actions with absolute certainty. Finally, Edman et al. 1291 
propose a combinatorial anonymity metric that measures the 
amount of information needed to reveal the full set of rela- 
tionships between the inputs and the outputs of a mix. Some 
extensions of this model were proposed by Gierlichs et al. [ 3011 
and by Bagai et al. 1311 . 

Having examined the most relevant metrics in the field of 
anonymous communications, now we briefly touch upon some 
of the proposals intended for the scenario of LBS. Particularly, 
the issue of quantifying privacy in this scenario has been 
explored in f32l and revisited shortly afterwards in f33l. At a 
conceptual level, we encounter the same underlying principle 
proposed here, in the sense that the authors propose to measure 
privacy as the adversary's expected estimation error for that 



particular context. We shall discuss later in Sec. VI- A that their 
specific metric for LBS may be construed as an illustrative 
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special case of our own work, and describe notable differences 
with respect to our generic framework. 

B. Privacy Criteria in Statistical Disclosure Control 

In statistical disclosure control terminology, a microdata set is 
a database whose records contain information at the level of in- 
dividual respondents. In those databases, each row corresponds 
to an individual and each column, to an attribute. According to 
the nature of attributes, we may classify them into identifiers, 
key attributes or quasi-identifiers, or confidential attributes. 
On the one hand, identifiers allow to unequivocally identify 
individuals. For example, it would be the case of social 
security numbers or full names, which would be removed 
before the publication of the microdata set. On the other hand, 
key attributes are those attributes that, in combination, may be 
linked with external information to reidentify the respondent 
to whom the records in the microdata set refer. Last but not 
least, confidential attributes contain sensitive information on 
the respondents, such as health condition, political affiliation, 
religion or salary. 

k-Anonymity (H, is the requirement that each tuple of 
key attribute values be shared by at least k records in the 
database. This condition is illustrated in Fig. [T] where a mi- 
crodata set is /c-anonymized before publishing it. Particularly, 
this privacy criterion is enforced by using generalization and 
suppression, two mechanisms by which key attribute values 
are respectively coarsened and eliminated. As a result, all key 
attribute values within each group are replaced by a common 
tuple, and thus a record can not be unambiguously linked to 
any public database containing identifiers. Consequently, k- 
anonymity is said to protect microdata against linking attacks. 

Unfortunately, while this criterion prevents identity disclo- 
sure, it may fail against the disclosure of the confidential 
attribute. Concretely, suppose that a privacy attacker knows 
Emmanuel's key attribute values. If the attacker learns that 
he is included in the released table depicted in Fig. |l(b) 



then the attacker may conclude that this patient suffers from 
hepatitis even though the attacker is unable to ascertain which 
record belongs to this individual. This is known as similarity 
attack, meaning that values of confidential attributes may 
still be semantically similar. More generally, the skewness 
attack exploits the difference between the prior distribution of 
confidential attributes in the whole data set and the posterior 
distribution of those attributes within a specific group. 

All these vulnerabilities motivated the appearance of a 
number of proposals, some of which we now overview. An en- 
hancement of /c-anonymity called p-sensitive /c-anonymity 1 10] 
incorporates the additional restriction that there be at least p 
distinct values for each confidential attribute within each k- 
anonymous group. With the aim of addressing the data utility 
loss incurred by large values of p, /-diversity iHTI proposes 
instead that there be at least / "well-represented" values for 
each confidential attribute. Unfortunately, both proposals are 
still vulnerable to similarity attacks and skewness attacks. 

In an attempt to overcome all these deficiencies, t-close- 
ness |[T2l was proposed. A microdata set satisfies t-closeness 
if, for each group of records with the same tuple of perturbed 
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Fig. 1: Generalization and suppression of key attribute values to attain k- 
anonymity. 

key attribute values, a measure of discrepancy between the 
posterior and prior distributions does not exceed a threshold t. 
Inspired by this measure, |15 | defines an (average) privacy 
risk as the conditional Kullback-Leibler (KL) divergence be- 
tween the posterior and the prior distributions, a measure 
that may be regarded as an averaged version of t-closeness. 
Further, this average privacy risk is shown to be equal to 
the mutual information between the confidential attributes and 
the observed, perturbed key attributes, and, finally, a con- 
nection is established with Shannon's rate-distortion theory. 
A related criterion named 5-disclosure is proposed in |13|, 
a yet stricter version that measures the maximum absolute 
log ratio between the prior and the posterior distributions. 
Lastly, 1 14 1 analyzes privacy for interactive databases, where 
a randomized perturbation rule is applied to a true answer to a 
query, before returning it to the user. Consider two databases 
that differ only by one record, but are subject to a common 
perturbation rule. Conceptually, the randomized perturbation 
rule is said to satisfy the e-dijferential privacy criterion if the 
two corresponding probability distributions of the perturbed 
answers are similar, according to a certain inequality. Later in 



Sec. |V-B| we provide further details about these privacy criteria 
and relate them in terms of our formulation. 

III. Preliminaries 

In this section, we shall present our convention regarding 
random variables (r.v.'s) and probability distributions. Next, 
we shall introduce some elementary concepts for those readers 
who are not familiar with Bayes decision theory (BDT). 

Throughout this paper, we shall follow the convention of 
using uppercase letters to denote r.v.'s, and lowercase letters 
to the particular values they take on. We shall call alpha- 
bet the values an r.v. takes on. Probability mass functions 
(PMFs) are denoted by p, subindexed by the corresponding 
r.v. Accordingly, px{x) denotes the value of the function px 
at X. Informally, we occasionally refer to the function px as 
Px{x). Similarly, we use the notations Px\Y and Px\Y{Ay) 
equivalently. In addition, we shall follow the notation in |[34ll 
to specify that two sequences ak and hk are approximately 
equal in the exponent if lim/c^oo \ log ^ = 0. 

Having adopted this convention, now we recall the basics on 
BDT. Namely, BDT is a statistical method that, fundamentally, 
uses a probabilistic model to analyze the making of decisions 
under uncertainty and the costs associated with those deci- 
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sions (33, 1361 . In general, Bayes decision principles may be 
formulated in the following terms. Consider the uncertainty 
refers to an unknown parameter modeled by an r.v. X. In 
decision-theoretic terminology, this is also known as state of 
nature. Let Y be another r.v. modeling an observation or 
measurement on the state of nature. Suppose that, given a 
particular observation y, we are required to make a decision on 
the unknown. Let x denote the estimator of X, that is, the rule 
that provides a decision or estimate x{y) for every possible 
observation y. Clearly, any decision will be accompanied 
by a cost. This is captured by the loss function d{x,x{y))^ 
which measures how costly the decision x{y) will be when 
the unknown is x. However, since the actual loss incurred 
by a decision can not be calculated with absolute certainty 
at the time the decision is made, BDT contemplates the 
average loss associated with this decision. Concretely, the 
Bayes conditional risk for an estimator x is defined in the 
discrete case as 

n{y) = E[d{X,x{y))\y] = Y.Px\Y{x\y)d{x,x{y)), 

X 

where the expectation is taken over the posterior probability 
distribution Px\Y{Ay)- According to this, the Bayes risk 
associated with that estimator is defined as the average of the 
Bayes conditional risk over all possible observations y, that 
is. 



7^ = EE[d(x,x(r))|y■] 



x,y 



Pxvix.y) d{x,x{y)), 



where the expectation is additionally taken over the probability 
distribution of Y. Based on this definition, an estimator is 
called Bayes estimator or Bayes decision rule, if it minimizes 
the Bayes risk among all possible estimators. It turns out that 
this optimal estimator is precisely 

^Bayes(^) = arg min E[d(X, x) |?/] , 

X 

for all y; i.e., the Bayes estimator is the one that minimizes 
the Bayes conditional risk for every observation. 

Once some of the basic elements in Bayes analysis have 
been examined, we would like to establish a connection 
between maximum a posteriori (MAP) estimator and Bayes 
estimator. With this aim, first recall that a MAP estimator, as 
the name implies, is the estimator that maximizes the posterior 
distribution. Now consider the loss function d{x^x) to be the 
Hamming distance between x and x, which is an indicator 
function, and recall that the expectation of an indicator r.v. is 
the probability of the event it is based on. Mathematically, 

E[G^Hamming(^,^)|^] = P{^ ^ x\y} , 

and consequently, 

^MAp(^) = argminP{X ^ x\y} 



— argmaxPjX = x\y}. 



(1) 



In conclusion, Bayes and MAP estimators coincide when the 
loss function is Hamming distance. 

IV. Measuring Privacy as an Attacker's 
Estimation Error 

This section presents our first contribution, a general frame- 
work that lays the foundation for the establishment of a unified 
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TABLE I: Simplified representation of our notation. 

measurement of privacy. However, it is not until Sec. |V] 
where we shall show that a number of privacy criteria may 
be regarded as particular cases of our proposal. Previously, 
Sec. |IV-A| introduces our notation. Next, Sec. |IV-B| describes 
the adversarial model. In Sec. 



IV-C 



we present our privacy 
metric, and finally, in Sec. |IV-D| we illustrate the proposed 



formulation with a simple but insightful example. 

A. Mathematical Assumptions and Notation 

In this section, we provide the notation that we shall use 
throughout this work. To this end, we first introduce the key 
actors of the proposed framework: 

• a user, who wishes to protect their privacy; 

• a (trusted) system, to which each user entrusts their 
private data for its protection; the unique purpose of 
this entity is to guarantee the privacy of the user, and 
with this aim, the system may use any privacy-preserving 
mechanism at its disposal; 

• and an attacker, who strives to disclose private informa- 
tion about this user. 

To clarify the elements involved in our framework, consider 
a conceptually- simple approach to anonymous Web browsing, 
consisting in a TTP acting as an intermediary between Internet 
users and Web servers. From the perspective of our model, the 
users would be those subscribed to the anonymous proxy; the 
system would be this proxy; and the attackers those servers 
that attempt to compromise users' privacy from their Web 
browsing activity. 

In the following, the term r.v. is used with full generality 
to include categorical or numerical data, vectors, tuples or se- 
quences of mixed components, but for mathematical simplicity 
we shall henceforth assume that all r.v.'s in the paper have 
finite alphabets. 

• The attacker's unknown or uncertainty is denoted by the 
r.v. X, which models the private information about a user 
that the attacker wishes to ascertain. 

• The system's input is represented by the r.v. X^ and refers 
to user's data required by the system to make a decision. 

• The systems 's decision is modeled by the r.v. Y' and 
denotes disclosed information, perhaps part of or a 
perturbation. 

• The attacker's input or observation is denoted by the 
r.v. Y and captures any evidence or measurement the 
attacker has about the unknown. In some circumstances, 
this observation may be directly the information disclosed 
by the system, that is, Y = Y\ Put another way, the 
only information available to the attacker is exactly that 
revealed by the system. In other cases, the attacker's 
input may be a perturbation of Y', perhaps together with 
background knowledge the attacker may have acquired. 
In such cases, we have that Y ^Y'. 
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TABLE II: Description of the variables used in our notation in the special 
case of SDC. Often, X = X' 2.ndY = Y' . 

• The attacker's decision is modeled by the r.v. X and 
represents the attacker's estimate of X from Y. 

In order to clarify this notation, we provide an example 
in which the above variables are put in the context of SDC. 
In this scenario, the data publisher plays the role of the 
system. Concretely, X may represent identifying or confiden- 
tial attribute values the attacker endeavors to ascertain with 
regard to an individual appearing in a released table. The 
individuals contained in this table are what we call users. 
The system's input becomes now the key attribute values that 
the publisher has about the individuals. On the other hand, 
Y' is the perturbed version of those values, which jointly with 
the (unperturbed) confidential attribute values, constitute the 
released table. Furthermore, the attacker's input consists of 
the released table and, possibly, background knowledge the 
privacy attacker may have. For example, this could be the case 
of a voter registration list. In the end, the attacker's decision 
is the estimate of X. All this information is shown in Table [ill 

Similarly, now we specify the variables of our framework 
in the special case of a mix. Under this scenario, the mix 
represents the system, whose objective is to hide the cor- 
respondence between the incoming and outgoing messages. 
Precisely, the attacker's uncertainty is this correspondence. 
The system's input and system's decision are the arrival 
and departure times of the messages, respectively. On the 
other hand, the information available to the attacker, i.e., the 
attacker's observation F, consists of X\ Y' and the design 
parameters of the mix. Finally, X is the attacker's decision on 
the correspondence between the messages. This is depicted in 
Fig. [2] and summarized in Table [In 



B. Adversarial Model 

The consideration of a framework that encompasses a variety 
of privacy criteria necessarily requires the formalization of the 
attacker's model. In this spirit, we now proceed to present the 
parameters that characterize this model. 

Firstly, we shall contemplate an adversarial model in which 
the attacker uses a Bayes (best) decision rule. Conceptually, 
this corresponds to the estimation made by an attacker that 
uses optimally the available information, as we formally 



argued in Sec. Ill Namely, for every possible decision of the 
system resulting in an observation the attacker will make 
a Bayes decision x{y) on X. With regard to this attacker's 
decision rule, we would like to remark the fact that, whereas 
it is a deterministic estimator, the system's decision is assumed 
to be a randomized perturbation rule given by Py'\x'' 
a consequence of this, it is clear that the system does not 
leak any private information when deciding Y' , provided that 
Y' and X' are statistically independent. 
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Fig. 2: Our framework is put in the context of mixes. 
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Secondly, as explained in Sec. Ill we shall require to 
evaluate the cost of each decision made by the attacker. 
For this purpose, we consider the attacker's distortion func- 
tion dp,{x^x{y)), which measures the degree of dissatisfac- 
tion that the attacker experiences when X = x and X = 
xijj). Similarly, we contemplate the system's distortion func- 
tion ds{x\y^), which reflects the extent to which the system, 
and therefore the user, is discontent when Y^ = y' and X' = 
x'. 

A crucial distinction in the type of attacker's distortion 
function dp^ considered will be whether it captures a sort of 
geometry over the symbols of the alphabet, or not. The most 
evident example of distortion function that does not take into 
account this geometry is the Hamming function, which we al- 
ready introduced at the end of Sec. [In] Concretely, this binary 
metric just indicates whether x and x coincide, and provides 
no more information about the discrepancy between them. On 
the other hand, the squared error loss dp,{x^x) = (x — x)^ 
and the absolute error loss dx{x^x) = \x — x\ are just two 
commonly-used examples of distortion functions that do rely 
or induce a certain geometry. 

C. Definition of our Privacy Criterion 

Bearing in mind the above considerations, and consistently 



with Sec. Ill we define conditional privacy as 

V{y)=E[dA{X,x{y))\y], (2) 
which is the estimation error incurred by the attacker, con- 
ditioned on the observation y. Based on this definition, we 
contemplate two possible measures of privacy. In particular, 
we define worst-case privacy as 

Vnnn = mmV{y). (3) 

y 

On the other hand, we define average privacy as 

^vg = EV{Y) = EdA{X,x{Y)), (4) 
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which is the average of the conditional privacy over all 
possible observations y. 

In order to measure the utility loss caused by the pertur- 
bation of the original data, we define the average distortion 
as 

D = Eds(X^y'0. (5) 
According to these definitions, a privacy-protecting system 
and an attacker would adopt the following strategies. Namely, 
the system would select the decision rule Py'\x' ^^at max- 
imizes either the average privacy or the worst-case privacy, 
while not allowing the average distortion to exceed a certain 
threshold. On the other hand, the attacker would choose the 
Bayes estimator, which would lead to the minimization of both 
measures of privacy. The reason behind this is that the Bayes 
estimator also minimizes the conditional privacy, as stated in 
SecHni 

On a different note, we would like to remark that a privacy 
risk IZ in lieu of V could be defined for —dp^{x^ x{y)) instead 
of dp,{x^x{y)). An analogous argument justifies the use of 
utility instead of distortion. 

Last but not least, we would also like to note that, in the 
special case when the unknown variable X models the identity 
of a user, our measure of privacy may be regarded, in fact, as 
a measure of anonymity. 

D. Example 

Next, we present a simple example that sheds some light on 
the formulation introduced in the previous sections. 

For the sake of simplicity, consider X' = X, that is, the 
system's input is the confidential information that needs to be 
protected. Suppose that X is a binary r.v. with P{X = 0} = 
P{X = 1} = 1/2. In order to hinder privacy attackers in their 
efforts to ascertain X, for each possible outcome x, the system 
will disclose a perturbed version y' . Namely, with probability p 
the system will decide to reveal the complementary value of x, 
whereas with probability 1 — p no perturbation will be applied, 
i.e., y' = X. Note that, in this example, the system's decision 
rule is completely determined by p, for which we conveniently 
impose the condition ^ p < 1/2. 

At this point, we shall assume that the attacker only has 
access to the disclosed information Y\ and therefore the 
attacker's input Y boils down to it. We anticipate that, through- 
out this work, this supposition will be usual. In addition, 
we shall consider the attacker's distortion function to be the 



Hamming distance. However, as commented on in Sec. Ill this 
implies that the Bayes estimator matches the MAP estimator. 
According to this observation, it is easy to demonstrate that 
the attacker's best decision is X = F. Therefore, the average 
privacy ^ becomes 

Pavg = P{X ^X} = P{X ^Y} = P{X ^ Y'} = p. 

On the other hand, if we suppose that the system's distortion 
function is also the Hamming distance, from ([5]), it follows 
that 

V = F{X' ^ ¥'} = P{X ^ Y'} = p. 

Based on these two results, we now proceed to describe the 
strategy that the attacker would follow. To this end, we define 




Fig. 3: Representation of the trade-off curve between privacy and utility for 
the example. 
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Fig. 4: The arguments that lead to the interpretation of several privacy metrics 
as particular cases of our definition of privacy are conceptually organized in 
the above points. 

the average utility U sls 1 — V. According to this, the system 
would strive to maximize the average privacy with respect to p, 
subject to the constraint U ^ uq. Fig. |3] illustrates this simple 
optimization problem by showing the trade-off curve between 
privacy and utility. In this example, it is straightforward to 
verify that the optimal value of average privacy is T^avg^^ = 
1 — 1/0, for 1/2 < ^ 1- 

V. Theoretical Analysis 

In this section, we present our second contribution, namely, 
the interpretation of several well-known privacy criteria as 
particular cases of our more general definition of privacy. The 
arguments behind the justification of these privacy metrics as 
a particularization of our criterion are based on numerous 
concepts from the fields of information theory, probability 
theory and BDT. In this section, we therefore approach this 
issue from a theoretical perspective; however, we refer those 
readers not particularly interested in the mathematical details 
to Sec. Eni 

For a comprehensive exposition of these arguments, the 
underlying assumptions and concepts will be expounded in 
a systematic manner, following the points sketched in Fig. |4] 
As mentioned in Sec. |IV-B| and illustrated by the first branch 
of the tree depicted in this figure, our starting point makes the 
significant distinction between attacker's distortion measures 
based on the Hamming distance and the rest, according to 
whether we wish to capture a certain, gradual measure of dis- 
tance between alphabet values beyond sheer symbol equality. 
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It is important to recall from Sec. Ill that in the case of a 
Hamming distortion measure, expected distortion boils down 
to probability of error, yielding a different class of estimation 
problems. 



Bearing in mind the above remark, in Sec. V-A we shall 
contemplate the case when the attacker's distortion function 



is the Hamming distance, whereas in Sec. V-B we shall deal 
with the more general case in which dp, can be any other 
distortion function. In the special case of Hamming distance, 
we consider two alternatives for the variables in Table U 
single-occurrence and multiple-occurrence data. The former 
case considers the variables to be tuples of a small number of 
components, and the latter case assumes that these variables 
are sequences of data. In the scenario of single-occurrence 
data, we shall establish a connection between Hartley's entropy 
and our privacy metric, which will allow us to interpret k- 
anonymity, /-diversity and min-entropy criteria as particular 
cases of our framework. The arguments that will enable us to 
justify this connection stem from MAP estimation, BDT and 
the concept of confidence set. On the other hand, when we 
consider multiple-occurrence data, we shall use the asymp- 
totic equipartition property (AEP) to argue that the Shannon 
entropy, as a measure of privacy, is a characterization of the 
cardinality of a high-confidence set of sequences. 

In the more general case in which the attacker's distortion 
function is not the Hamming distance, we shall explore two 
possible scenarios. On the one hand, we shall consider the 
case where this function is known to the system. Under the 
assumption of a Bayes attacker's strategy, we shall use BDT to 
justify the system's best decision rule. On the other hand, we 
shall contemplate the case in which the attacker's distortion 
function is unknown to the system. Specifically, this scenario 
will allow us to connect our framework to several privacy 
criteria through the concept of total variation, provided that 
the attacker uses MAP estimation. 



A. Hamming Distortion 

In this section, we shall analyze the special case when 
the attacker's distortion function is the Hamming distance, 
commented on in Sees. |lll| and |IV-B| In addition, we shall 
contemplate two cases for the variables of our framework: 
single-occurrence and multiple-occurrence data. 

1) Single Occurrence: This section considers the scenario 



in which the variables defined in Sec. IV-A are tuples of a rela- 
tively small number of components, including both categorical 
and numerical data, defined on a finite alphabet. In order to 
establish a connection between some of the most popular 
privacy metrics and our criterion, first we shall introduce 
the concept of confidence set and briefly recall a riveting 
generalization of Shannon's entropy. 

Consider an r.v. X taking on values in the alphabet X. A 
confidence set ^ with confidence p is defined as a subset of X 
such that P{X G "^j = p. In the case of continuous -valued 
random scalars, confidence sets commonly take the form of 
intervals. In these terms, it is clear that a privacy attacker 
aimed at ascertaining X will benefit the most from those 



confidence sets whose cardinality is reduced substantially with 
respect to the original alphabet size, with high confidence. To 
connect the concept of confidence set to our interpretation of 
privacy as an attacker's estimation error, consider an attacker 
model where the attacker only takes into account the shape of 
the PMF of the unknown X to identify a confidence set ^ for 
some desired confidence p, and beyond that, assumes all the 
included members equally relevant. This last assumption may 
be interpreted as an investigation on a tractable list of potential 
identities, carried out in parallel. MAP estimation within that 
set, considering it uniformly distributed, leads to an estimation 
error of 1 — that is, a bijection of its cardinality. 

In our interpretations, we further use the Renyi entropy, a 
family of functionals widely used in information theory as a 
measure of uncertainty. More specifically, Renyi 's entropy of 
order a is defined as 

1 

H«(X) = - logVpx(:rO", 

1 — a ^-^ 

i—l 

where px is the PMF of an r.v. X that takes on values in 
the alphabet X = {xi, . . . , x^}. In the important case when 
a is 0, Renyi 's entropy is essentially given by the support set 
of Px, that is, 

Ro{X)=\og\{xeX :px{x)>0}\. 

In this particular case, Renyi 's entropy is referred to as 
Hartley's entropy. Evidently, when px is strictly positive, the 
support set becomes the alphabet and Ho(X) = logn. Under 
this assumption, the Hartley entropy can be understood as a 
confidence set with p = 100%. On the other hand, in the limit 
when a approaches 1, Renyi 's entropy reduces to Shannon's 



Hi(X) 



Px{xi) log px{xi). 



Lastly, in the limit as a goes to oo, the Renyi entropy 
approaches the min-entropy 

Hoo(X) = min-logpx{xi) - logmaxpx(^i). 

i i 

We shall shortly interpret min-entropy. Shannon's entropy 
and Hartley's entropy within our general framework of privacy 
as an attacker estimation error, when Hamming distance is 
used as a distortion measure, first for single occurrences of a 
target information, and later for multiple occurrences. For now, 
we could loosely consider an attacker striving to ascertain the 
outcome of the finite-alphabet r.v. X, and the effect of the 
dispersion of its PMF on such task. Conceptually, we could 
then regard these three types of entropies simply as worst- 
case, average-case and best-case measurements of privacy, 
respectively, on account of the fact that 

H^(X)^Hi(X)^Ho(X), (6) 

with equality if, and only if, X is uniformly distributed. 
More specifically, the min-entropy Hoo(X) is the minimum 
of the surprisal or self -information — logpx(^i), whereas the 
Shannon entropy Hi(X) is a weighted average of such loga- 
rithms, and finally, the Hartley entropy Ho(X) optimistically 
measures the cardinality of the entire set of possible values of 
X regardless of their likelihood. 

After showing the Hartley, Shannon and min entropies are 
particular cases of Renyi 's entropy, now we go on to describe 
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Fig. 5: A data publisher plans to release a 3-anonymized microdata set. To this end, the publisher must enforce that, for a given tuple of key attribute values 
in (b), the probability of ascertain the identifier value of the corresponding record in (a) must be at most Vs. 



a scenario that will allow us to relate our privacy metric to 
an extensively-used criterion. Specifically, we focus on the 
important case of SDC, where the data publisher plays the 
system's role. In this scenario, a data publisher wishes to 
release a microdata set and, before distributing it, the publisher 
appHes some algorithm fTol|, |fTTl|, |fT2l, |fT3l, |fT4l, |fT5l to 
enforce the /c-anonymity requirement |8|, |9|. As mentioned 



in Sec. |II-B[ the objective of a linking attack is to unveil the 
identity of the individuals appearing in a released table by 
linking the records in this table to any public data set including 
identifiers. Since /c-anonymity is aimed at protecting the data 
against this attack, in our scenario the attacker's unknown X 
becomes the user identity. The other variables shown in 
Table [n] are as follows: are the key attribute values, are 
the perturbed key attribute values, the attacker's observation Y 
is assumed to be Y\ and finally, X is an estimate of the 
identity of a user. Fig. [5] illustrates this particular case. 

In order to protect the data set from identity disclosure, the 
algorithm must ensure that, for any observation y consisting 
in a tuple of perturbed key attribute values in the released 
table, the identifier value of the corresponding record in the 
original table cannot be ascertained beyond a subgroup of at 
least k records. As we shall see next, this requirement will 
be reflected mathematically by assuming that the probability 
distribution Px\Y{x\y) of the identifier value, conditioned on 
the observation y, is the uniform distribution on a set of at 
least k individuals. Lastly, we consider the more general case 
in which Y consists of Y^ and any background knowledge. 

That said, our adversarial model contemplates an attacker 
who uses a MAP estimator, which, as shown in Sec. |lllj is 
equivalent to the Bayes estimator. Under this model, given an 
observation y, the conditional privacy ^ becomes 

V{y) = F{X 7^ x{y)\y} = 1 - maxp^iY (^1^), (7) 

which precisely is the MAP error e^^p, conditioned on that 
observation y; in terms of min-entropy, we may recast our 
metric as 



1 



-^oaiX\y) 



which shows that the concept of min-entropy is intimately re- 
lated to MAP decoding. If we finally apply the aforementioned 
uniformity condition of px\Y{x\y), and assume that this PMF 
is the uniform distribution on a group of exactly k individuals. 



that is, = for alH = 1, . . . , /c, then 

V{y) = 1 - 1/k = 1 - 2-^^^^\y\ 

which expresses the conditional privacy in terms of Hartley's 
entropy. In a nutshell, the /c- anonymity criterion may be inter- 
preted as a special case of our privacy measure, determined 
by this Renyi's entropy. 

After examining this first interpretation, next we shall 
explore an enhancement of /c-anonymity. As argued in Sec.[lll| 
this criterion does not protect against confidential attribute 
disclosure. In an effort to address this limitation, several pri- 
vacy metrics were proposed. In the remainder of this section, 
we shall focus on one of these approaches. In particular, we 
shall consider the /-diversity metric |11|, which builds on the 
/c-anonymity principle and aims at overcoming the attribute 
disclosure problem. 

As mentioned in Sec. |II-B| a microdata set satisfies /- 
diversity if, for each group of records sharing a tuple of key 
attribute values in the perturbed table, there are at least / "well- 
represented" values for each confidential attribute. Depending 
on the definition of well-represented, this criterion can reduce 
to distinct /-diversity, which is equivalent to /-sensitive k- 
anonymity, or be more restrictive. Concretely, a microdata 
is said to meet the entropy /-diversity requirement if, for 
each group of records with the same tuple of perturbed key 
attribute values, the entropy of each confidential attribute is at 
least log /. 

In our new scenario, a data publisher, still playing the 
system's role, applies an algorithm on the microdata set to 
enforce the /-diversity principle. Since the aim of this criterion 
is to protect the data against attribute disclosure, we consider 
that the attacker's unknown X refers to the confidential at- 
tribute. The other variables remain the same as in our previous 
interpretation. 

Having said that, we shall make the assumption that the /- 
diversity requirement is met by enforcing that, for a given 
tuple y of perturbed key attribute values, the probability 
distribution Px\Y{x\y) of the confidential attribute within the 
group of records sharing this tuple is the uniform distribution 
on a set of at least / values. This is depicted in Fig. [6] Note 
that this assumption entails that the data fulfill both the distinct 
and entropy /-diversity requirements. Lastly, we shall suppose 
again that the attacker uses MAP estimator. 
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Fig. 6: In this example, the 2-diversity principle is applied to a microdata set. In order to meet this requirement, we assume that, for each group of records 
with the same tuple of perturbed key attribute values, the probability distribution of the confidential attribute value in (b) is the uniform distribution on a set 
of at least 2 values. 



As mentioned before, under the premise of a MAP attacker, 
our measure of conditional privacy boils down to the MAP 
error If we also apply the assumption above about the 
uniformity of Px|y(^|^). and suppose that this distribution 
is uniform on a group of / individuals, then the conditional 
privacy yields 

which expresses our privacy metric again in terms of Hartley's 
entropy. In short, the /-diversity criterion lends itself to be 
interpreted as a particular case of our more general privacy 
measure. 

2) Multiple Occurrences: In this section, we shall consider 
the case when the variables shown in Table |l| are sequences of 
categorical and numerical data but in a finite alphabet. In the 
following, we shall use the notation to denote a sequence 

The special case that we contemplate now could perfectly 
model the scenario in which a user interacts with an LBS 
provider, through an intermediate system protecting the user's 
location privacy. In this scenario, a user would submit queries 
along with their locations to the trusted system. An ex- 
ample would be the query "Where is the nearest parking 
garage?", accompanied by the geographic coordinates of the 
user's current location. As many approaches suggest in the 
literature of private LBS, the system would perturb the user 
coordinates and submit them to the LBS provider. In this 
context, the consideration of sequences in our notation makes 
sense. Specifically, an attacker would endeavor to ascertain 
the sequence of k unknown locations visited by the user, 
from the sequence Y'^ of k perturbed locations that the 
system would submit to the LBS. Put differently, the attacker's 
unknown would be the location data the user conveys to the 
system, i.e., X^ = X'^ , and the information available to the 
adversary the perturbed version of this data, that is, = Y'^. 

Having motivated the case of sequences of data, in this 
section we shall establish a connection between our metric 
and Shannon's entropy as a measure of privacy. But in order 
to emphasize this connection, first we briefly recall one of 



the pillars of information theory: the asymptotic equipartition 
property |34|, which derives from the weak law of large 
numbers and results in important consequences in this field. 

Consider a sequence X^ of k independent, identically dis- 
tributed (i.i.d.) r.v.'s, drawn according to px{x), with alphabet 
size n. Loosely speaking, the AEP states that among all 
possible sequences, there exists a typical subset of 
sequences almost certain to occur. More precisely, for any 
e > 0, there exists a k sufficiently large such that P{5^^} > 
1 - e, and |^/| ^ 2^^^^^^)+^). A similar argument called 
joint AEP 1341 also holds for the i.i.d. sequences 
of length k drawn according to X^i^x^xvip^i^Vi)- Another 
information-theoretic result is related to those sequences 
that are jointly typical with a given typical sequence . 



Namely, the set of all these sequences is referred to as 
the conditionally typical set 



hand, that V{Sr^ } > 1 



and satisfies, on the one 
e for large k, and on the 



other, that its cardinality is bounded by Shannon's conditional 
entropy, |^/| ^ 2^^^^^^^^^^"^ . Further, it turns out that these 
conditionally typical sequences are equally likely, with proba- 
bility 2~^^i(^l^\ approximately in the exponent. While the 
most likely sequence may in fact not belong to the typical set, 
the set of typical sequences encompasses a sufficiently large 
number of sequences that amount to a probability arbitrarily 
close to certainty. 

Next, we proceed to interpret, under the perspective of our 
framework, the Shannon entropy as a measure of privacy. To 
this end, consider the scenario in which a privacy attacker 
observes a typical Y^ and strives to estimate the unknown X^ . 
Conveniently, we assume X^ = X'^ and Y^ = Y'^, which 
models the LBS example described before, provided that 
the attacker ignores any spatial-temporal constraint. In other 
words, we model a scenario without memory and hence sup- 
pose that {Xi^Yi) are i.i.d. drawn according to pxY{xijyi). 
We would like to stress that the consideration of this simplified 
model is just for the purpose of providing a simple, clear 
example that illustrates the application of our framework. 
Having said this, in the terms above we may regard as 
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a set of arbitrarily high confidence with cardinaHty 2^^i^^l^\ 
approximately in the exponent. 

The upshot is that the Shannon (conditional) entropy of 
an unknown r.v. (given an observed r.v.) is an approximate 
measure of the size of a high-confidence set, measure suitable 
for attacker models based on the estimation of sequences, 
rather than individual samples. Moreover, within this confi- 
dence set, sequences are equally likely, approximately in the 
exponent, concordantly with the interpretation of confidence- 
set cardinality as a measure of privacy made in Sec. |V-Al]on 



single occurrences. Even though for simplicity our argument 
focused on memoryless sequences, the Shannon-McMillan- 
Breiman theorem is a generalization of the AEP to stationary 
ergodic sequences, in terms of entropy rates |37|. 

Finally, we mentioned that the most likely sequence may 
in fact be atypical, and thus Shannon entropy is not directly 
applicable to MAP estimation over the entire set of sequences. 
Nevertheless, because the most likely memoryless sequence is 
simply a repetition of the most likely symbol, MAP estimation 
on sequences is a trivial extension of the argument on min- 



entropy presented in Sec. V-Al 



B. Non-Hamming Distortion 

This section investigates the complementary case described in 
Sec. |Vl in which the attacker's distortion function is not the 
Hamming distance. In particular, we turn our attention to the 
scenario of SDC and consider the more general and realistic 
case in which the attacker's distortion function is unknown 
to the data publisher. The only piece of information which is 
though known to the publisher is dmax = m.d,yix,xdA{x^x). 
On the contrary, if the attacker's distortion function were 
entirely known, the system would definitely use BDT to find 
the decision rule Py'\X' which maximizes either the worst- 
case privacy ^ or the average privacy (|4]), and satisfies a 
constraint on average distortion. 

Bearing in mind the above consideration, in our new sce- 
nario a privacy attacker endeavors to guess the confidential 
attribute value of a particular respondent in the released table. 
Initially, the attacker has a prior belief given by px^ that is, 
the distribution of that confidential attribute value in the whole 
table. Later, the attacker observes that the user belongs to a 
group of records sharing a tuple of perturbed key attribute 
values y, which is supposed to coincide with the system's de- 
cision y' . Based on this observation, the attacker updates their 
prior belief and obtains the posterior distribution Px\Y{'\y)' 
This situation is illustrated in Fig. [7] A fundamental question 
that arises in this context is how much privacy the released 
table leaks as a result of that observation. In the remainder of 
this section, we elaborate on this question and provide an upper 
bound on the reduction in privacy incurred by the disclosure 
of that information. 

1) Total Variation and t-Closeness: For notational sim- 
plicity, we occasionally rename the posterior and the prior 
distributions Px\Y{'\y) and px simply with the symbols p 
and respectively, but bear in mind that p is a PMF of 
X parametrized by y. In addition, we shall assume that the 



attacker adopts a MAP strategy. More precisely, Xp and Xq will 
denote the attacker's estimate when using the distributions p 
and q. Under these assumptions, the reduction in conditional 
privacy can be expressed as 

/\V{y) = dA(X, Xq) - E^ dA(X, xp) 

= Ep dA{X, Xq) - Eq dA{X, Xq) + Eq C^a(X, Xq) 
- Eq dA{X, Xp) + Eq dA{X, Xp) - Ep dA{X, Xp), 

where Ep and Eq denotes that the expectation is taken over 
the posterior and the prior distributions, respectively, as PMFs 
of X. 

In this expression, the first two terms can be upper bounded 

by (imax \Px -Qxl since ^^{Px - Qx) ^ 1^^ " ^^1- 
Clearly, this same bound applies to the last two terms. On the 
other hand, the remaining terms Eq dA{X^ Xq) — Eq dA{X^ Xp) 
are upper bounded by 0, since the error incurred by Xq is 
smaller than or equal to that of Xp. In the end, we obtain that 

AV{y) ^ 2 dmax ^ \Px - Qxl ' 

X 

At this point, we shall briefly review the concept of total 
variation. For this purpose, consider P and Q to be two 
PMFs over X. In probability theory, the total variation distance 
between P and Q is 



TV(P II Q) 



xex 



Pix)-Qix)\. 



Furthermore, recall that, in information theory, Pinsker's in- 
equality relates the total variation distance with the KL di- 
vergence. Particularly, TV(P || Q) ^ ^^D(P||Q). Having 
stated this result, now the total variation distance permits 
writing the upper bound on AV{y) in terms of the KL 
divergence: 

AV{y) ^ 4d^axTV(p II q) ^ 2V^d^ax V^{p\\q). 

where the last inequality follows from Pinsker's inequality. 
Returning to the notation of prior and posterior distributions, 

AV{y)^4d,^,,TY{px\Y{-\y)\\px) 



^{Px\Y{-\y)\\px)- (8) 

This upper bound allows to establish a connection between 
our privacy criterion and t-closeness lfT2ll . The latter criterion 
boils down to defining a maximum discrepancy between the 
posterior and prior distributions, 

t = maxD(px|y(-|^) \\Px)- 

y 

Under this definition and on account of ([s]), AV{y) ^ 
2>/2(imaxV^. Therefore, t-closeness is essentially equivalent 
to bounding the decrease in conditional privacy. 

On a different note, we would like to make a comment 
on an issue of a purely technical nature. Clearly, in light of 
inequality ([S]), the minimization of either the total variation 
distance or the KL divergence leads to the minimization 
of an upper bound on AV{y). However, the fact that the 
KL divergence imposes a worse upper bound suggests us 
considering it when the resulting mathematical model be more 
tractable than the one built upon the total variation distance. 
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Fig. 7: At first, an attacker believes that the probability that a user appearing in (b) suffer from AIDS is 1/2. However, after observing that the user's record 
is one of the last four records, this probability becomes 3/4. 



2) Mutual Information and Rate -Distortion Theory: The 
privacy criterion proposed in |[T5| , called {average) privacy 
risk 1Z, is the average-case version of t-closeness. Formally, 
7^ is a conditional KL divergence, the average discrepancy 
between the posterior and the prior distributions, which turns 
out to coincide with the mutual information between the 
confidential data X and the observation Y: 



n = EYD{px\Y{-\Y)\\px) 



Ey Ex\Y 



log 



Px\y{X\Y) 



Elog 



Px{X) ^ 
Px\y{X\Y) 



1{X;Y). 



Px{X) 

Directly from their definition, 1Z ^ t, meaning that t-closeness 
is a stricter measure of privacy risk. Because the KL diver- 
gence is itself an average, 1Z is clearly an average-case privacy 
criterion, but t closeness is technically a maximum of an ex- 
pectation, a hybrid between average case and worst case. The 
next subsection will comment on a third, purely worst-case 
criterion. When choosing a privacy criterion, it is important 
to keep in mind that optimizing a privacy mechanism for the 
best worst-case scenario will in general yield a worse average 
case, and viceversa. 

Further, we conveniently rewrite inequality ([8| as 

g^AP(y)2<D(te|y(-|y)||px). 
By averaging over all possible observation y, the right-hand 
side of this inequality becomes the privacy risk 1Z, which we 
showed to be equal to the mutual information. This leads to a 
bound on the privacy reduction in terms of mutual information, 

^E[APiYf]^IiX;Y). 

Based on this observation, it is clear that the minimization 
of the mutual information contributes to the minimization 
of an upper bound on AV{y). With this in mind, we now 
consider the more general scenario in which Y^ and Y need 
not necessarily coincide, and contemplate the case of a data 
publisher. Concretely, from the perspective of a publisher, we 
would choose a randomized perturbation rule Py'\x' with 
the aim of minimizing the mutual information between X 
and Y, and consequently protecting user privacy. Evidently, the 



publisher would also need to guarantee the utility of the data to 
a certain extent, and thus impose a constraint on the average 
distortion. In conclusion, the data publisher would strive to 
solve the optimization problem 

min (9) 

Py'\x' 
Edu{X',Y')^V 

which surprisingly bears a strong resemblance with the rate- 
distortion problem in the field of information theory. 

More specifically, the above optimization problem is a gen- 
eralization of a well-known, extensively studied information- 
theoretic problem with more than half a century of maturity. 
Namely, the problem of lossy compression of source data with 
a distortion criterion, first proposed by Shannon in 1959 |[38]| . 

The importance of this lies in the fact that some of 
the information-theoretic results and methods for the rate- 
distortion problem can be extended to the problem For 
example, in the special case when X = X^ and Y = Y\ our 
more general problem boils down to Shannon's rate-distortion 
and, interestingly, can be computed with the Blahut-Arimoto 
algorithm [1341 . 

3) S -Disclosure and Differential Privacy: Finally, we 
quickly remark on the connection of ^-disclosure and e- 
differential privacy with our theoretical framework. 5-disclo- 
sure |[T3l is an even stricter privacy criterion than t-closeness, 
and hence much stricter than that average privacy risk IZ or 
mutual information, discussed in the previous subsection. The 
definition of (5-disclosure may be rewritten in terms of our 
notation as 

px\Y{Ay) 



max 

x,y 



log - 



px{x) 

and understood as a worst-case privacy criterion. In fact. 

We mentioned in the background section that [14J analyzes 
the case of the randomized perturbation F of a true answer X 
to a query in a private information retrieval system, before 
returning it to the user. Consider two databases d and d' 
that differ only by one record, but are subject to a common 
perturbation rule Py\x^ and let py and py be the two 
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probability distributions of perturbed answers induced. After 
a slight manipulation of the definition given in the work cited, 
but faithfully to its spirit, we may say that a randomized 
perturbation rule provides e-differential privacy when 

1 pviy) 

e — max log . 

Even though it is clear that this formulation does not quite 
match the problem in terms of prior and posterior distri- 
butions described thus far, this manipulation enables us to 
still establish a loose relation with ^-disclosure, in the sense 
that the latter privacy criterion is a slightly stricter measure 
of discrepancy between PMFs, also based on a maximum 
(absolute) log ratio. We note however, that although there is 
a formal similarity between the metrics, there are substantial 
differences between them in terms of their assumptions, ob- 
jectives, models, and privacy guarantees. 

VI. Numerical Example 

This section provides two simple albeit insightful examples 
that illustrate the measurement of privacy as an attacker's 
estimation error. Specifically, we quantify the level of pri- 
vacy provided, first, by a privacy-enhancing mechanism that 
perturbs location information in the scenario of LBS, and 
secondly, by an anonymous-communication protocol largely 
based on Crowds [2QJ . 

A. Data Perturbation in Location-Based Services 

Our first example contemplates a user who wishes to access 
an LBS provider. For instance, this could be the case of a user 
who wants to find the closest Italian restaurant to their current 
location. For this purpose, the user would inevitably have 
to submit their GPS coordinates to the (untrusted) provider. 
To avoid revealing their exact location, however, the user 
itself could perturb their location information by adding, for 
example, Gaussian noise. Alternatively, we could consider a 
user delegating this task to a (trusted) intermediary entity, as 



attacker's 
estimate 



described in Sec. |V-A2[ In any case, data perturbation would 
enhance user privacy in terms of location, although clearly 
at the cost of data utility. Simply put, perturbative privacy 
methods present the inherent trade-off between data utility and 
privacy. 

Under the former strategy, and in accordance with the 
notation defined in Sec.|IV-A] the user becomes the system — it 



is the user who is responsible for protecting their location data. 
Playing the role of the system, the user decides then to perturb 
their location data X on an individual basis for each query. In 
other words, we do not contemplate the case of sequences of 
data X^, as Sec. |V-A2| does. 

A key element of our framework is the attacker's distortion 
function. In our example we assume the squared error between 
the actual location x and the attacker's estimate f , that is, 
dp^{x^x) = ||x — Unlike Hamming distance, note that 
the squared error does quantify how much the estimate differs 
from the unknown. As for the other variables of our model, we 
contemplate that the attacker's input Y is directly the location 
data perturbed by the user, Y\ as illustrated in Fig. [s] Put 
differently, the attacker, assumed to be the service provider. 



m< location 

ill ^ 

'iji ^^^^ 




provider 



Fig. 8: A user looking for a nearby Italian restaurant accesses an LBS provider. 
The user decides to perturb their actual location before querying the provider. 
In doing so, the user hinders the provider itself and any attacker capable of 
capturing their query, in their efforts to compromise user privacy in terms of 
location. In this example, we contemplate that the user is solely responsible 
for protecting their private data. In terms of our notation, this allows us to 
regard the user as the system. Notice that the user's actual location is, on 
the one hand, the attacker's unknown, and on the other, the information 
that the user (system) takes as input to generate the location that will be 
finally revealed. Thus we conclude that X = X' . Then, according to some 
randomized perturbation rule Py'\x'^ the user discloses, for each location 
data x\ a perturbed version y' . This perturbed location is submitted to the 
provider, which only has access to this information, i.e., Y = Y' . Lastly, 
based on this revealed information, the attacker uses a Bayes estimator x(y) 
to ascertain the user's actual location X. 

has no more information than that disclosed by the user. Under 
all these assumptions, the average privacy ^ is 

Pavg = E[||X-Xf], 

that is, the mean squared error (MSE). 

As a final remark, we would like to connect our privacy 
criterion with a metric specifically conceived for the LBS 
scenario at hand |33|. In this cited work, the authors propose 
a framework that contemplates different aspects of the adver- 
sarial model, captured by means of what they call certainty, 
accuracy and correctness. The information to be protected 
by a trusted intermediary system are traces modeling the 
locations visited by users over a period of time. The system 
accomplishes this task by hiding certain locations, reducing 
the accuracy of such locations or adding noise. As a result, 
the attacker observes a perturbed version of the traces and, 
together with certain mobility profiles of these users, attempts 
to deduce some information of interest X about the actual 
traces. In terms of our notation, the observed trajectories and 
the mobility patterns constitute the attacker's observation Y. 

More accurately, given a particular observation y, the at- 
tacker strives to calculate the posterior distribution px\Y- 
However, since the adversary may have a limited number 
of resources, they may have to content themselves with an 
estimate Px\Y- The authors then use Shannon's entropy to 
measure the uncertainty of X, and define accuracy as the 
discrepancy between px\Y and Px\Y- Finally, they refer to 
location privacy as correctness and measure it as 



^PX\Y 



[ds{X, xt)\y], 



where Xt is the true outcome of X, a distance function 
specified by the system, and the expectation is taken over the 
estimate of the posterior distribution. 

The most notable difference between |[33ll and our own 
work is that the authors limit the scope of their metric to the 
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specific scenario of location-based services; whereas here we 
attempt to provide a general overview. Besides, their proposal 
is a measure of privacy in an average-case sense. Another 
difference is that they argue against the use of entropy and 
/c-anonymity for the purposes of their field of application. In 
this regard, we not only justify but also relate entropy and 
/c-anonymity within a generalized perspective on attacker's 
estimation errors, rather than excluding them. Lastly, their 
implementation of estimation strategies using the forward- 
backward |39| and the Metropolis-Hastings ll4Qll algorithms 
are undoubtedly of great interest, but the focus of the present 
work is on metrics. 

B. Crowds-like Protocol for Anonymous Communications 



In Sec. II-A we mentioned Chaum's mixes as a building 
block to implement anonymous communications networks. 
A different approach to communication anonymity is based 
on collaborative, peer-to-peer architectures. An example of 
collaborative approach is Crowds 1201 , in which users form 
a "crowd" to provide anonymity for each other. 

In Crowds, a user who wants to browse a Web site forwards 
the request to another member of his crowd chosen uniformly 
at random. This crowd member decides with probability p to 
send the request to the Web site, and with probability 1 — p 
to send it to another randomly chosen crowd member, who 
in turn repeats the process. For the purposes of illustration, 
we consider a variation of the Crowds protocol. The main 
difference with respect to the original Crowds is that we do not 
introduce a mandatory initial forwarding step. We note that this 
variation provides worse anonymity than the original protocol, 
while also reducing the cost (in terms of delay and bandwidth) 
with respect to Crowds. Further, we assume that the users 
participating in the protocol are honest; i.e., we only consider 
the Web site receiving the request as possible adversary. 

More formally, consider n users indexed by i = 1, . . . , n, 
wishing to communicate with an untrusted server. In order to 
attain a certain degree of anonymity, each user submits the 
message directly to said server with probability p G (0, 1), 
and forwards it to any of the other users, including themselves, 
with probability 1—p. In the case of forwarding, the recipient 
performs exactly the same probabilistic decision until the 
message arrives at the server. Fig. [9] shows the operation of 
this protocol. 

In our protocol, we assume that the server attempts to guess 
the identity of the author of a given message, represented 
by the r.v. X, knowing only the user who last forwarded 
it, represented by the r.v. Y, consistently with the notation 
defined in Sec. IIV-AI The other variables of our framework 
are as follows. Since the set of users involved in the protocol 
collaborate to frustrate the efforts of the server, they are in 
fact the system. The information that then serves as input to 
this system is simply the identity of the user who initiates 
the forwarding protocol, X. That is, the attacker's uncertainty 
and the system's input coincide, X' = X. Then again, the 
assumption that the server just knows the last sender in the 
forwarding chain leads to Y = Y\ 

Under this model, and under the assumption of a uniform 
message-generation rate, that is, px{x) = l/n for all x, it can 
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Fig. 9: Anonymous-communication protocol inspired by Crowds. In our 
second numerical example, we contemplate a scenario where users send 
messages to a common, untrusted server, who aims at compromising sender 
anonymity. In response to this privacy threat, users decide to adhere to a 
modification of the Crowds protocol, whose operation is as follows: each 
user flips a biased coin and depending on the outcome chooses to submit the 
message to the server or else to another user, who is asked to perform the 
same process. The probability that a user forward the message to the server is 
denoted by p, whereas the probability of sending it to any other peer, including 
themselves, is (1 — p)/n. 



be proven that the conditional PMF of X given Y = y is 

p^(l-p)/n , x = y 



px\Y{Ay) 



(l-p)/n 



x^y 



(10) 



Fig. [T0| shows this conditional probability in the particular 
case when x = 1, i.e., the probability that the originator of a 
message be user 1 , conditioned to the observation that the last 
sender is user y. Note that, because of the symmetry of our 
model, it would be straightforward to derive a PMF analogous 
to the one plotted in this figure, but for other originators of 
the message, namely x = 2, . . . , n. 

P{X = 1\Y = y} 

I — I p + (1 - p)/n 



n - 1 
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{l-p)/n 
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Fig. 10: Probability that the original sender of a given message be the user 
1, conditioned to the observation that the last sender in the forwarding path 
is user y. From this figure, we observe the PMF attains its maximum value 
when this last sender is precisely the user 1. 

That said, assume that the attacker chooses Hamming 
distance as distortion function. Under this assumption, the 
conditional privacy ^ yields 

r{y) = P{X^x{y)\y}, 

that is, the MAP error conditioned on the observation y. 
Because Hamming distance implies, by virtue of ([T]), that 
Bayes estimation is equivalent to MAP estimation, it follows 
that the attacker's (best) decision rule is x{y) = y. Leveraging 
on this observation, we obtain that the privacy level provided 
by this variant of Crowds is 

V{y) = = 1-P{X = y\y} = (1 - - l/n), 
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from which it follows an entirely expected result — the lower 
the probability p of forwarding a message directly to the 
server, the higher the privacy provided by the protocol, but 
the higher the delay in the delivery of said message. 

In the following, we consider the measurement of the 
privacy protection offered by this protocol, in terms of the 
three Renyi's entropies introduced in Sec. |V-A[ namely the 
min-entropy Hoo(-^|y), the Shannon entropy 'Ri{X\y) and 
the Hartley entropy Ho(X|?/) of the r.v. X, modeling the 
actual sender of a given message (the privacy attacker's 
target), given the observation of the user who last forwarded 
it, y. Specifically, we connect the interpretations described in 
Sec. |V-A| to the example at hand. 

But first we would like to recall from Sec. IV-All that 
Hoo(X'ly), lli{X\y) and 'Ro{X\y) may be considered, from 
the point of view of the user, as a worst-case, average-case and 
best-case measurements of privacy, respectively, in the sense 
that 

Hoo(X|y)^Hi(X|^)^Ho(X|^), 

owing to ([6]), with equality if and only if the conditional PMF 
of X given Y = y is uniform. Revisiting the interpretations 
given in that section, recall that the min-entropy Hoo(X|?/) is 
directly connected with the maximum probability, in our case 
maxx. Px\Y{xi\y) = p-\- {I — p)/n, on account of ([T0|. More 
concretely, and in the context of our example, min-entropy 
reflects the model in which a privacy attacker makes a single 
guess of the originator of a message, specifically the most 
likely one, which corresponds to x = y. 

At the other extreme, the Hartley entropy Ho(X|?/) is a pos- 
sibilistic rather than probabilistic measure, as it corresponds 
to the assumption that a privacy attacker would not content 
themselves with discarding all but the most likely sender, but 
consider instead all possible users. More accurately, measuring 
privacy as a Hartley's entropy essentially boils down to the 
cardinality of the set of all possible originators of a message, 
namely llo{X\y) = logn. 

On a middle ground lies Shannon's entropy, which was 



interpreted in Sec. V-A2 by means of the AEP, specifically in 
terms of the effective cardinality of the set of typical sequences 
of i.i.d. samples of an r.v. Put in the context of our Crowds-like 
protocol, however. Shannon's entropy may be deemed as an 
average-case metric that considers the entire PMF of X given 
Y = y, and not merely its maximum value or its support set. 

vn. Guide for System Designers 

The purpose of this section is to serve as a guide for those 
system designers who wish to quantify the level of protection 
offered by their privacy-enhancing technologies, without hav- 
ing to delve into the mathematical details set forth in Sec.|V| In 
order to assist such designers in the selection of the privacy 
metric most appropriate for their requirements, this section 
revises the application scenarios of SDC and anonymous 
communications, and classifies some of the metrics used in 
these fields in terms of worst case, average case and best case, 
from the perspective of the user. 

In the scenario of SDC, a data publisher aims at protecting 
the privacy of the individuals appearing in a microdata set. 



Depending on the privacy requirements, the publisher may 
want to prevent an attacker from ascertaining the confiden- 
tial attribute value of any respondent in the released table. 
Under this requirement, t-closeness and mutual information 
appear as acceptable measures of privacy, since both criteria 
protect against confidential attribute disclosure. Recall that 
the assumptions on which they are based are a prior belief 
about the value of the confidential attribute in the table, and 
a posterior belief of said value given by the observation that 
the user belongs to a particular group of this table. Building 
on these premises, t-closeness may be regarded as a worst- 
case measurement of privacy, in the sense that it identifies the 
group of users whose distribution of the confidential attribute 
deviates the most from the distribution of this same attribute 
in the entire table. In this regard, we would like to note that a 
worst-case metric from the point of view of the user is a best- 
case measure from the standpoint of the attacker, and vice 
versa. 

Although t-closeness overcomes the similarity and skewness 
attacks mentioned in Sec. |II-B| its main limitation is that 
no computational procedure to reach this criterion has been 
specified. An alternative is the mutual information between 
the confidential attributes and the observation, an average- 
case version of t-closeness that leads to a looser measure of 
privacy. In any of these two metrics, it is assumed the more 
general case in which the attacker's distortion function is not 
the Hamming distance. Specifically, this assumption models 
an adversary who does not content themselves with finding 
out whether the estimate and the unknown match, but wishes 
to quantify how much they diverge. 

Another distinct privacy requirement is that of identity 
disclosure, whereby a publisher wishes to protect the released 
table against a linking attack. In this attack, the adversary's 
aim is to uncover the identity of the individuals in the released 
table by linking the records in this table to a public data 
set including identifier attributes. Under this requirement and 
under the assumption that the attacker regards each respondent 
within a particular group as equally likely, /c-anonymity may 
be deemed as a best-case measure of privacy, determined by 
Hartley's entropy. We refer to this criterion as a best-case 
metric precisely due to the naive assumption of a uniform 
distribution of the identifier attribute. In other words, the 
underlying adversarial model does not contemplate that an 
attacker may have background knowledge that allows them 
to consider certain users as more likely than others. In the 
end, we may also regard the /-diversity criterion as a best- 
case metric, since it assumes a uniform distribution of the 
confidential attribute on a set of at least / values. Put another 
way, this rudimentary adversarial model does not contemplate, 
for example, the fact that certain values of the confidential 
attribute may be semantically similar. 

In the scenario of anonymous-communication systems, there 
exists a wide variety of approaches. Among them, a popular 
anonymous-communication protocol is Crowds. Although in 
this section we limit the discussion of the privacy provided 
by such systems to a variant of this protocol, we would like 
to stress that the conclusions drawn here may be extended to 
other anonymous systems. Having said this, recall that in the 
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original Crowds protocol, a system designer makes available 
to users a collaborative protocol that helps them enhance 
the anonymity of the messages sent to a common, untrusted 
Web server. The design parameters are the number of users 
participating in the protocol and the probability of forwarding 
a message directly to the server. 

In our variant of this protocol, however, we contemplate an 
attacker who strives to guess the identity of the sender of a 
given message, based on the knowledge of the last user in 
the forwarding path. Under this adversarial model, we may 
regard min-entropy. Shannon's entropy or Hartley's entropy 
as particular cases of our measure of privacy, depending on 
the specific strategy of the attacker. For example, under an 
adversary who uses maximum a posteriori estimation and, 
accordingly, opts for the last sender, min-entropy may be 
interpreted as a worst-case privacy metric. Alternatively, we 
may assume an attacker that considers the entire probability 
distribution of possible senders, and not only the most likely 
candidate. In this case. Shannon's entropy may be deemed as 
an average-case measure. Finally, under a rudimentary attacker 
who takes into account just the number of potential originators 
of the message. Hartley's entropy may be regarded as a best- 
case measurement of privacy. 





Worst case 


Average case 


Best case 


statistical 
disclosure control 


t-closeness 


mutual information 


/c-anonymity 
/-diversity 


Anonymous- 
communication 
systems 


min-entropy 


Shannon's entropy 


Hartley's entropy 



TABLE IV: Guide for system designers. This table classifies several privacy 
metrics depending, first, on whether they are regarded as worst-case, average- 
case and best-case measures, and secondly on their application domain. 



VIII. Conclusion 

A wide variety of privacy metrics have been proposed in the 
literature. Most of these metrics have been conceived for spe- 
cific applications, adversarial models, and privacy threats, and 
thus are difficult to generalize. Even for specific applications, 
we often find that various privacy metrics are available. For 
example, to measure the anonymity provided by anonymous- 
communication networks, several flavors of entropy (Shannon, 
Hartley, min-entropy) can be found in the literature, while 
no guidelines exist that explain the relationship between the 
different proposals, and provide an understanding of how to 
interpret and put in context the results provided by each of 
them. 

In the scenario of SDC, numerous approaches attempt to 
capture, to a greater or lesser degree, the private information 
leaked as a result of the dissemination of microdata sets. In 
this spirit, /c-anonymity is possibly the best-known privacy 
measure, mainly due to its mathematical tractability. However, 
numerous extensions and enhancements were introduced later 
with the aim of overcoming its limitations. While all these 
metrics have provided further insight into our understanding 
of privacy, the research community would benefit from a 
framework embracing all those metrics and making it possible 



to compare them, and to evaluate any privacy-protecting 
mechanism by the same yardstick. 

In this work, we propose a privacy measure intended to 
tackle the above issues. Our approach starts with the definition 
and modeling of the variables of a general framework. Then, 
we proceed with a mathematical formulation of privacy, which 
essentially emerges from BDT. Specifically, we define privacy 
as the estimation error incurred by an attacker. We first propose 
what we refer to as conditional privacy, meaning that our 
measure is conditioned on an attacker's particular observation. 
Accordingly, we define the terms of average privacy and worst- 
case privacy. 

The formulation is then investigated theoretically. Namely, 
we interpret a number of well-known privacy criteria as 
particular cases of our more general metric. The arguments 
behind these justifications are based on fundamental results 
related to the fields of information theory, probability theory 
and BDT. More accurately, we interpret our privacy criterion 
as /c-anonymity and /-diversity principles by connecting them 
to Renyi's entropy and the concept of confidence set. Under 
certain assumptions, a conditional version of the AEP allows 
us to interpret Shannon's entropy as an arbitrarily high con- 
fidence set. Then, the total variation distance and Pinsker's 
inequality justify t-closeness requirement and the criterion 
proposed in |15 | as particular instances of our measure of 
privacy. In the course of this interpretation, we find that our 
formulation bears a strong resemblance with the rate-distortion 
problem in information theory. 

Our survey of privacy metrics, our detailed analysis of their 
connection with information theory, and our mathematical 
unification as an attacker's estimation error, shed new light on 
the understanding of those metrics and their suitability when 
it comes to applying them to specific scenarios. In regard 
to this aspect, two sections are devoted to the classification 
of several privacy metrics, showing the relationships with 
our proposal and the correspondence with assumptions on 
the attacker's strategy. While the former section approaches 
this from a theoretical perspective, the latter is written as a 
guide to help system designers choose the appropriate metrics, 
without having to delve into the mathematical details. We also 
hope to illustrate the riveting intersection between the fields 
of information privacy and information theory, in an attempt 
towards bridging the gap between the respective communities. 

A couple of simple albeit insightful examples are also 
presented. Our first example quantifies the level of privacy 
provided by a privacy-enhancing mechanism that perturbs 
location information in the scenario of LBS. Under certain 
assumptions on the adversarial model, our measure of privacy 
becomes the mean squared error. Then we turn our attention 
to the scenario of anonymous-communication systems and 
measure the degree of anonymity achieved by a modification 
of the collaborative protocol Crowds. We contemplate different 
strategies for the attacker and, accordingly, interpret min- 
entropy. Shannon's entropy and Hartley's entropy as worst- 
case, average-case and best-case privacy metrics. 

In closing, we hope that this unified perspective of privacy 
metrics, drawing upon the principles of information theory 
and Bayesian estimation, is a helpful, illustrative step towards 
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the systematic modeling of privacy-preserving information 
systems. 

Acknowledgment 

This work was partly supported by the Spanish Government 
through projects Consolider Ingenio 2010 CSD2007-00004 
"ARES", TEC2010-20572-C02-02 "Consequence" and by the 
Government of Catalonia under grant 2009 SGR 1362. Ad- 
ditional sources of funding include IWT SBO SPION, GOA 
TENSE, the lAP Programme P6/26 BCRYPT, and the FWO 
project "Contextual privacy and the proliferation of location 
data". D. Rebollo-Monedero is the recipient of a Juan de 
la Cierva postdoctoral fellowship, JCI-2009-05259, from the 
Spanish Ministry of Science and Innovation. C. Diaz is funded 
by an FWO postdoctoral grant. 

References 

[1] L. Willenborg and T. DeWaal, Elements of Statistical Disclosure Control. 

New York: Springer- Verlag, 2001. 
[2] T. B. Jabine, "Statistical disclosure limitation practices at united states 

statistical agencies," /. Official Stat., vol. 9, no. 2, pp. 427-454, 1993. 
[3] C. A. W. Citteur and L. C. R. J. Willenborg, "Public use microdata files: 

Current practices at national statistical bureaus," /. Official Stat., vol. 9, 

no. 4, pp. 783-794, 1993. 
[4] J. Domingo-Ferrer and J. M. Mateo-Sanz, "Practical data-oriented mi- 

croaggregation for statistical disclosure control," IEEE Trans. Knowl. 

Data Eng., vol. 14, no. 1, pp. 189-201, 2002. 
[5] J. Domingo-Ferrer and V. Torra, "Ordinal, continuous and heterogener- 

ous /c-anonymity through microaggregation," Data Min., Knowl. Disc, 

vol. 11, no. 2, pp. 195-212, 2005. 
[6] A. Solanas, A. Martmez-Balleste, and J. Domingo-Ferrer, "VMDAV: 

A multivariate microaggregation with variable group size," in Proc. 

Comput. Stat. (COMPSTAT). Rome, Italy: Springer- Verlag, 2006. 
[7] D. Rebollo-Monedero, J. Forne, and M. Soriano, "Private location-based 

information retrieval via /c- anonymous clustering," in Proc. CNIT Int. 

Workshop Digit. Commun., ser. Lecture Notes Comput. Sci. (LNCS). 

Sardinia, Italy: Springer- Verlag, Sep. 2009, invited paper. 
[8] L. Sweeney, "fc-Anonymity: A model for protecting privacy," Int. J. 

Uncertain., Fuzz., Knowl.-Based Syst., vol. 10, no. 5, pp. 557-570, 2002. 
[9] P. Samarati, "Protecting respondents' identities in microdata release," 

IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010-1027, 2001. 
[10] T. M. Truta and B. Vinay, "Privacy protection: p-sensitive fc-anonymity 

property," in Proc. Int. Workshop Privacy Data Manage. (PDM), Atlanta, 

GA, 2006, p. 94. 

[11] A. Machanavajjhala, J. Gehrke, D. Kiefer, and M. Venkitasubramanian, 
"^Diversity: Privacy beyond /c -anonymity," in Proc. IEEE Int. Conf. 
Data Eng. (ICDE), Atlanta, GA, Apr. 2006, p. 24. 

[12] N. Li, T. Li, and S. Venkatasubramanian, "^-Closeness: Privacy beyond 
fc-anonymity and ^diversity," in Proc. IEEE Int. Conf. Data Eng. 
(ICDE), Istanbul, Turkey, Apr. 2007, pp. 106-115. 

[13] J. Brickell and V. Shmatikov, "The cost of privacy: Destruction of data- 
mining utility in anonymized data publishing," in Proc. ACM SIGKDD 
Int. Conf Knowl. Disc, Data Min. (KDD), Las Vegas, NV, Aug. 2008. 

[14] C. Dwork, "Differential privacy," in Proc. Int. Colloq. Automata, Lang., 
Program. Springer- Verlag, 2006, pp. 1-12. 

[15] D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer, "From t- 
closeness-like privacy to postrandomization via information theory," 
IEEE Trans. Knowl. Data Eng., vol. 22, no. 11, pp. 1623-1636, Nov. 
2010. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/ 
ITKDE.2009.190 

[16] D. Chaum, "Untraceable electronic mail, return addresses, and digital 

pseudonyms," Commun. ACM, vol. 24, no. 2, pp. 84-88, 1981. 
[17] L. Cottrell, "Mixmaster and remailer attacks," 1994. [Online]. Available: 

http://obscura.com/"^loki/remailer/remailer-essay.html 
[18] G. Danezis, R. Dingledine, and N. Mathewson, "Mixminion: Design of 

a type III anonymous remailer protocol," in Proc. IEEE Symp. Security, 

Privacy (SP), Berkeley, CA, May 2003, pp. 2-15. 
[19] M. Duckham, K. Mason, J. Stell, and M. Worboys, "A formal approach 

to imperfection in geographic information," Comput., Environ., Urban 

Syst., vol. 25, no. 1, pp. 89-103, 2001. 



[20] M. K. Reiter and A. D. Rubin, "Crowds: Anonymity for web trans- 
actions," ACM Trans. Inform. Syst. Security, vol. 1, no. 1, pp. 66-92, 
1998. 

[21] O. Berthold, A. Pfitzmann, and R. Standtke, "The disadvantages of 
free MIX routes and how to overcome them," in Proc. Design. Privacy 
Enhanc. Technol: Workshop Design Issues Anon., Unobser, ser. Lecture 
Notes Comput. Sci. (LNCS). Berkeley, CA: Springer- Verlag, Jul. 2000, 
pp. 30-45. 

[22] A. Serjantov and G. Danezis, "Towards an information theoretic metric 

for anonymity," in Proc. Workshop Privacy Enhanc. Technol. (PET), vol. 

2482. Springer- Verlag, 2002, pp. 41-53. 
[23] C. Diaz, S. Seys, J. Claessens, and B. Preneel, "Towards measuring 

anonymity," in Proc. Workshop Privacy Enhanc. Technol. (PET), ser. 

Lecture Notes Comput. Sci. (LNCS), vol. 2482. Springer- Verlag, Apr. 

2002, pp. 54-68. 

[24] G. Toth, Z. Hornak, and F. Vajda, "Measuring anonymity revisited," in 

Proc. Nordic Workshop Secure IT Syst., Nov. 2004, pp. 85-90. 
[25] S. ClauB and S. Schiffner, "Structuring anonymity metrics," in Proc. 

ACM Workshop on Digit. Identity Manage. (DIM). Fairfax, VA: ACM, 

Nov. 2006, pp. 55-62. 
[26] P. Syverson and S. Stubblebine, "Group principals and the formalization 

of anonymity," in Proc. World Congr Formal Methods, 1999, pp. 814- 

833. 

[27] S. Mauw, J. Verschuren, and E. P. de Vink, "A formalization of 

anonymity and onion routing," in Proc. European Symp. Res. Comput. 

Security (ESORICS), vol. 3193. Lecture Notes Comput. Sci. (LNCS), 

2004, pp. 109-124. 
[28] J. Feigenbaum, A. Johnson, and P. Syverson, "A model of onion routing 

with provable anonymity," in Proc. Financ. Cryptogr, Data Security 

(FI). Springer- Verlag, 2007. 
[29] M. Edman, F. Sivrikaya, and B. Yener, "A combinatorial approach to 

measuring anonymity," IEEE J. Intell., Security Inform., pp. 356-363, 

2007. 

[30] B. Gierlichs, C. Troncoso, C. Diaz, B. Preneel, and I. Verbauwhede, 
"Revisiting a combinatorial approach toward measuring anonymity," in 
Proc. ACM Workshop on Privacy in the Electron. Society. ACM, 2008, 
pp. 111-116. 

[31] R. Bagai, H. Lu, R. Li, and B. Tang, "An accurate system-wide 
anonymity metric for probabilistic attacks," in Proc. Workshop Privacy 
Enhanc Technol. (PET), ser. Lecture Notes Comput. Sci. (LNCS), vol. 
6794. Springer- Verlag, 2011, pp. 117-133. 

[32] R. Shokri, J. Freudiger, M. Jadliwala, and J. P. Hubaux, "A distortion- 
based metric for location privacy," in Proc. ACM Workshop on Privacy 
in the Electron. Society, 2009. 

[33] R. Shokri, G. Theodorakopoulos, J. Y. L. Boudec, and J. P. Hubaux, 
"Quantifying location privacy," in Proc. IEEE Symp. Security, Privacy 
(SP). Washington, DC, USA: IEEE Comput. Soc, 2011, pp. 247-262. 

[34] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. 
New York: Wiley, 2006. 

[35] J. O. Berger, Statistical Decision Theory and Bayesian Analysis. New 
York: Springer- Verlag, 1985. 

[36] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. 
New York: Wiley, 2001. 

[37] P. H. Algoet and T. M. Cover, "A sandwich proof of the Shannon- 
McMillan-Breiman theorem," Annals Prob., vol. 16, no. 2, pp. 899-909, 
1988. 

[38] C. E. Shannon, "Coding theorems for a discrete source with a fidelity 
criterion," in IRE Nat. Conv. Rec, vol. 7 Part 4, 1959, pp. 142-163. 

[39] D. B. Reid, "An algorithm for tracking multiple targets," IEEE Trans. 
Autom. Control, vol. 24, no. 6, pp. 843-854, 1979. 

[40] W. K. Hastings, "Monte carlo sampling methods using markov chains 
and their applications," Biometrika, vol. 57, no. 1, pp. 97-109, 1970. 



